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1. INTRODUCTION 

There is currently a scarcity of aerial imaging data with pixel-level annotations for semantic 
segmentation. This is because labeling several types of objects in high-resolution images is a time-consuming 
process. This section will go over an initial method of manually hand-labeling data, the final datasets that were 
used, and the necessary preparations [1]. 

The Semantic Drone Dataset focuses on improving the safety of autonomous drone flying and landing 
processes by using semantic comprehension of urban settings. More than 20 buildings are depicted in the 
imagery, which was captured from a nadir (bird's eye) view at the height of 5 to 30 meters distant from the 
ground [2]-[4]. Images with a resolution of 6,000x4,000px were captured with a high-resolution camera 
(24Mpx) 5,6. The training set has 400 publicly available photographs, whereas the test set contains 200 private 
images. Objects in the image are listed as shown in Table 1 and Figure 1. 

The lack of adequately annotated aerial images led to the creation of a new dataset as a first step. The 
Institute fiir Maschinelles Sehen and Darstellen, Graz, provided this new dataset, which consisted of randomly 
modified images. The algorithm that was utilized to obtain the data is described. 1,000 images were selected 
randomly from the flight test database using this approach. A Microsoft Surface tablet, together with an Image 
Labeler from MATLAB toolbox were used to label each of these images. The labeling was done in three classes 
which are buildings, roads, and clouds. Figure 2 shows an example of one of the many labeled images. In 
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Figure 2(a), a settlement is depicted, which was selected as a test case to evaluate the developed method. 
The settlement represents a specific area or location of interest where the method was applied. By examining 
Figure 2(a), we can observe the characteristics and features of the settlement, such as buildings, roads, and 
other structures. To provide a more comprehensive analysis, Figure 2(b) presents the processed street view 
derived from Figure 2(a). This processed image emphasizes the street-level perspective, highlighting the details 
and elements relevant to the assessment conducted using the developed method. By focusing on the street view, 
we can gain a clearer understanding of the specific attributes and conditions present within the settlement. 


Table 1. List of objects in images 


Objects name Objects name Objects name Objects name 
Unlabelled Door Bicycle Rocks 
Paved-area Fence Tree Pool 

Dirt Fence-pole Bald-tree Vegetation 
Grass Person Ar-marker Roof 
Gravel Dog Obstacle Wall 
Water Car Conflicting Window 
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Figure |. List of objects in images 
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(b) 


Figure 2. Sample of initial hand-label approach (a) image block and (b) streets labeled as lines in the images 


2. METHOD 

Explaining research chronological, including research design, research procedure (in the form of 
algorithms, Pseudocode or other), how to test and data acquisition [2]—-[4]. The description of the course of 
research should be supported references, so the explanation can be accepted scientifically [5], [6]. Figures 1 
and 2 and Table | are presented center, as shown and cited in the manuscript [2], [7]-[12]. The settlement 
curves produced at SG1 has been illustrated in Figure 2(a) and SG2 has been illustrated Figure 2(b). 


2.1. Pre-processing 

The existing computational limits restrict the integration of full-sized high-resolution imagery into a 
CNN; the training cannot be done efficiently due to computational issues (memory) [7]-[9]. This necessitates 
preprocessing the images into smaller enhanced patches that will be used to train and test the models 10. The 
final prediction model was developed by mask, the final test images are resized back from the patches 11. 
Normalization of the spatial resolutions was also done for it to be used in the final integrated buildings dataset. 
Being that this is a popular semantic segmentation resolution, coupled with the availability of computing 
resources, the input image crops were chosen to have a resolution of 224x224 pixels. Image data generators 
were built in each step in the epoch for use with Keras, yielding batches of 32 randomly enhanced 224x224 
crops from the high-quality images. In this project, the utilized random augmentations include horizontal flips, 
rotations, and vertical flips. The "mirror" fill mode was employed, which takes the vacant areas of the image 
generated by rotation and mirrors it until such areas are filled. Being that aerial imagery is not usually 
homogeneous to one orientation, the model can rely on these augmentations to learn from diverse perspectives. 
This also creates artificially additional training data that can be learned by the model for pattern generalization. 


2.2. Data pre-processing 


The images were subjected to data augmentation before the training step [12]-[14]. The data 
augmentation was aimed at improving the dataset's size and performance, as well as the capacity to generalize 
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patterns. The pixel values were additionally normalized to make them between 0 and 1, with the goal of 
obtaining the loss-optimized value using gradient descent with fewer epochs. Data normalization enhances 
learning and generally improves the convergence speed. Image flipping can be done either vertically or 
horizontally or both. However, not all frameworks support vertical flips [15], [16]. A vertical flip is nothing 
but the rotation of an image by 180°, followed by its horizontal flipping. Horizontal flip is simply the horizontal 
reversing of all the columns and rows of an image pixel while the vertical flip is simply the vertical reversal of 
the entire columns and rows of an image pixel. 


2.3. Cropping and image normalization 

The select a piece of the original image at random. This part is then resized to the original image size. 
Random cropping is a frequent term for this technique. Image normalization is the important step for the 
training part because deep neural networks are all about writing a cost function and optimizing it so this will 
revolve around optimizer so to optimize the cost function in a smaller number of epochs its mandatory to do 
normalization or standardization but in case of images, normalization is the best approach [7]-[13] [14]-[19]. 


2.4. Gradient descent algorithm 

This is a search strategy for finding the local minimum of a function that can be differentiated, such 
as a loss function. Gradient descent algorithm (GDA) is mostly used in machine learning (ML) to determine the 
parameters/weights of a function that optimally reduces a cost function, algorithm steps as shown in (1) [20]. 
Repeat these steps until convergence is reached: 
— Calculate the parametric changes as a function of the learning rate based on the gradient. 
— Use the updated parameter value to recalculate the new gradient. 
— Check for the termination criterion, else revert to step one. 
The GDA is represented the 0 parameter for optimization algorithm in the (1). 


8 -a 55 J(@) (1) 


As a configurable parameter, the learning rate has a small positive value ranging from 0.0 to 1.0; it is 
used to train neuron networks (NNs) (note that logistic regression is a NN with just one neuron) [21], [22]. 
Hence, the learning rate determines the rate of adaptation of the model to the situation. Some modifications in 
gradient descent algorithm and making workable for powerful deep neural networks [23], [24]. There are a few 
drawbacks of GDA, especially the required number of computations for each iteration of the algorithm. For 
instance, assume a case of 20,000 data points with 20 features; here, the sum of squared residuals will contain 
the same number of terms as the data points (that is 20,000 terms in this case). Hence, the derivative of this 
function must be computed based on each of the features. This will require the following computation: 
20,000x20=400,000 computations in each iteration. If we consider 1,000 iterations, we may be arriving at 
400,000 1,000=400,000,000 computations to fully implement the GDA which is much an overhead; hence, 
GDA is not usually fast on huge datasets, leading to the development of stochastic gradient descent (SGD). 
The term “stochastic” here means “random” which comes to play when data points are being selected at each 
step for the calculation of the derivatives. The SGD picks one data point randomly from the entire dataset per 
iteration to minimize the computation requirements [25]. 

In stochastic gradient descent, we are feeding only one data point at each iteration, but it is a very slow 
process so to make it speed we have to create batches like in each batch consists of 16, 32, 48, 64 data samples 
and training on batches make the training process fast and reduce the computation as well [26]-[28]. The Adam 
optimizer employs a hybrid of two gradient descent techniques: Momentum: This approach is used to speed 
up the GDA in consideration of the ‘exponentially weighted average’ of the gradients. The algorithm tends to 
converge rapidly towards the minima because it uses the averages. Root mean squared propagation (RMSprop) 
was proposed by Geoffrey Hinton as a gradient-based NN training approach. The gradients of complex 
functions, such as NNs tend to either vanish or explode as the data propagates through such function. RMS 
prop was created as a stochastic mini-batch learning algorithm that solves problems by normalizing the gradient 
with a moving average of squared gradients. This process of normalization equalizes the step size and either 
lower it for high gradients to prevent explosion or raise it for minor gradients to avoid vanishing. Simply 
expressed, RMSprop treats the learning rate as an adaptive parameter rather than a hyperparameter. This 
indicates that the rate of learning fluctuates with time. 


2.5. Convolution layer 

Convolutional layers are used to produce a larger receptive filter, which allows the model to recognize 
more input images features. Examining all of the pixels in an image is a simple way to process it; however, 
this can delay the training process dramatically. image pixels are sparse, there could be several zero pixels that 
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have no important feature of interest. So, convolution is the application of filter to an image and its subsequent 
swiping over to narrow down the underlying image to more specific and distinct features that are more specific 
to the considered object [29]. 

A “convolutional layer is made up of a series of filters, each of which has its own set of parameters 
that must be learned. The filter is slid over the input's width and height, and the dot products between the input 
and filter are calculated at each spatial position before being fed into an activation function. The convolutional 
layer's output volume is done by layering the activation maps of all filters along with the depth” dimension [30]. 

Image compression in the substituting the feature map output with a summary statistic of nearby 
outputs, the pooling layer compresses an image. For instance, the maximum statistic is used in max-pooling to 
decide a sliding window’s result, leading to a down-sampled feature map with fewer redundant pixels and 
faster training and lower memory use. In average pooling, the sliding window’s result is computed using the 
average Statistic; so, the averaging of the feature map is first done before it is transferred to the next layer. 
Therefore, the use of the pooling layer when building a deep neural network (DNN) can significantly reduce 
the amount of information conveyed by an image [31]. 

This function is used mostly in problems that involve multiclass classification because of its desirable 
performance in real number value compression into a value range of [32] and ensuring that the sum of the 
whole probabilities of the output equals 1. It uses Sigmoid as an activation function to compress all input values 
into a value range between [32]. However, input values >0 produce a result >0.5 while those < zero produces 
results <0.5. Finally, any input with a 0 value produces a result equal to 0.5 [32]. Most times the sigmoid 
function is commonly used as the final output layer in binary classification problems [33]. 

The dropout process is a simple one; The addition of a dropout procedure after a neuron network layer 
either destroys or removes a random number of neurons. The chance of deleting a neuron is determined by the 
predefined dropout rate, p; for instance, if p= 0.5, there is a 0.5 chance of losing every neuron before feeding 
to the next layer. The more the number of destroyed neurons, the greater the regularization impact. The 
downstream neurons’ contribution to activation is temporally eliminated from the feed-forward path during the 
training phase, and their updated weight is not fed to the backward path neurons. Furthermore, the more the 
number of deleted neurons, the fewer the number of training samples for the subsequent layer, and this can 
cause the problem of underfitting. In practice, the hidden layer should have a typical dropout rate p in the range 
of 0.2 and 0.5, while that of the input layer is 0.2. 

The training of DNNs is challenging for a variety of reasons, including the fact that the input from 
prior layers can change when weights are updated. Batch normalization is one of the network input 
standardization approaches that can be used for either prior layer activations or direct inputs. This process 
reduces generalization error by speeding up training and providing some regularization, in some cases halving 
epochs or more. 


3. CONCLUSION 

A model that fits well on the training data but performs poorly on unobserved data is said to be over- 
fitted. For instance, a 3-D function is used to fit a small dataset given by a 1-D function. In deep learning (DL), 
a model with much capacities, such as one that learns more function mapping from input to output, can learn 
too well and overfit the training dataset; this happens mostly when the function makes a prediction that fits too 
close to the data point. For this research, the first processing material needed is the unmanned aerial vehicle 
(UAV) remote sensing image; however, the acquired data from the camera sensor comes with several 
disadvantages. Hence, the first thing is to take some transportation before the labeling and training phase. 
Furthermore, CNN was used to process and extract road information from the UAV images. Owing to the huge 
number of parameters in the CNN, the result cannot be assured if the model is applied to actual road detection 
these parameters have been randomly initialized. As a result, the network parameters must be trained before 
using CNN in the real world. The initial stage in evaluating the effect of network training is to train and test 
the samples manually. The application area of CNN is limited by the enormous number of samples it requires. 
In this paper, manual sample data labeling fell well short of the number of samples required for network 
training. Studies have previously proved the efficacy of data augmentation in the training of network models. 
As aresult, samples for network training were added based on the real scene via data augmentation. 
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